115 research outputs found
A Primer on Causality in Data Science
Many questions in Data Science are fundamentally causal in that our objective
is to learn the effect of some exposure, randomized or not, on an outcome
interest. Even studies that are seemingly non-causal, such as those with the
goal of prediction or prevalence estimation, have causal elements, including
differential censoring or measurement. As a result, we, as Data Scientists,
need to consider the underlying causal mechanisms that gave rise to the data,
rather than simply the pattern or association observed in those data. In this
work, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to
provide an introduction to some key concepts in causal inference. Similar to
other causal frameworks, the steps of the Roadmap include clearly stating the
scientific question, defining of the causal model, translating the scientific
question into a causal parameter, assessing the assumptions needed to express
the causal parameter as a statistical estimand, implementation of statistical
estimators including parametric and semi-parametric methods, and interpretation
of our findings. We believe that using such a framework in Data Science will
help to ensure that our statistical analyses are guided by the scientific
question driving our research, while avoiding over-interpreting our results. We
focus on the effect of an exposure occurring at a single time point and
highlight the use of targeted maximum likelihood estimation (TMLE) with Super
Learner.Comment: 26 pages (with references); 4 figure
A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure
We often seek to estimate the impact of an exposure naturally occurring or
randomly assigned at the cluster-level. For example, the literature on
neighborhood determinants of health continues to grow. Likewise, community
randomized trials are applied to learn about real-world implementation,
sustainability, and population effects of interventions with proven
individual-level efficacy. In these settings, individual-level outcomes are
correlated due to shared cluster-level factors, including the exposure, as well
as social or biological interactions between individuals. To flexibly and
efficiently estimate the effect of a cluster-level exposure, we present two
targeted maximum likelihood estimators (TMLEs). The first TMLE is developed
under a non-parametric causal model, which allows for arbitrary interactions
between individuals within a cluster. These interactions include direct
transmission of the outcome (i.e. contagion) and influence of one individual's
covariates on another's outcome (i.e. covariate interference). The second TMLE
is developed under a causal sub-model assuming the cluster-level and
individual-specific covariates are sufficient to control for confounding.
Simulations compare the alternative estimators and illustrate the potential
gains from pairing individual-level risk factors and outcomes during
estimation, while avoiding unwarranted assumptions. Our results suggest that
estimation under the sub-model can result in bias and misleading inference in
an observational setting. Incorporating working assumptions during estimation
is more robust than assuming they hold in the underlying causal model. We
illustrate our approach with an application to HIV prevention and treatment
Recommended from our members
Examining Obedience Training as a Physical Activity Intervention for Dog Owners: Findings from the Stealth Pet Obedience Training (SPOT) Pilot Study
Dog training may strengthen the dog–owner bond, a consistent predictor of dog walking behavior. The Stealth Pet Obedience Training (SPOT) study piloted dog training as a stealth physical activity (PA) intervention. In this study, 41 dog owners who reported dog walking ≤3 days/week were randomized to a six-week basic obedience training class or waitlist control. Participants wore accelerometers and logged dog walking at baseline, 6- and 12-weeks. Changes in PA and dog walking were compared between arms with targeted maximum likelihood estimation. At baseline, participants (39 ± 12 years; females = 85%) walked their dog 1.9 days/week and took 5838 steps/day, on average. At week 6, intervention participants walked their dog 0.7 more days/week and took 480 more steps/day, on average, than at baseline, while control participants walked their dog, on average, 0.6 fewer days/week and took 300 fewer steps/day (difference between arms: 1.3 dog walking days/week; 95% CI = 0.2, 2.5; 780 steps/day, 95% CI = −746, 2307). Changes from baseline were similar at week 12 (difference between arms: 1.7 dog walking days/week; 95% CI = 0.6, 2.9; 1084 steps/day, 95% CI = −203, 2370). Given high rates of dog ownership and low rates of dog walking in the United States, this novel PA promotion strategy warrants further investigation
Estimating Effects on Rare Outcomes: Knowledge is Power
Many of the secondary outcomes in observational studies and randomized trials are rare. Methods for estimating causal effects and associations with rare outcomes, however, are limited, and this represents a missed opportunity for investigation. In this article, we construct a new targeted minimum loss-based estimator (TMLE) for the effect of an exposure or treatment on a rare outcome. We focus on the causal risk difference and statistical models incorporating bounds on the conditional risk of the outcome, given the exposure and covariates. By construction, the proposed estimator constrains the predicted outcomes to respect this model knowledge. Theoretically, this bounding provides stability and power to estimate the exposure effect. In finite sample simulations, the proposed estimator performed as well, if not better, than alternative estimators, including the propensity score matching estimator, inverse probability of treatment weighted (IPTW) estimator, augmented-IPTW and the standard TMLE algorithm. The new estimator remained unbiased if either the conditional mean outcome or the propensity score were consistently estimated. As a substitution estimator, TMLE guaranteed the point estimates were within the parameter range. Our results highlight the potential for double robust, semiparametric efficient estimation with rare event
Adaptive Selection of the Optimal Strategy to Improve Precision and Power in Randomized Trials
Benkeser et al. demonstrate how adjustment for baseline covariates in
randomized trials can meaningfully improve precision for a variety of outcome
types. Their findings build on a long history, starting in 1932 with R.A.
Fisher and including more recent endorsements by the U.S. Food and Drug
Administration and the European Medicines Agency. Here, we address an important
practical consideration: *how* to select the adjustment approach -- which
variables and in which form -- to maximize precision, while maintaining Type-I
error control. Balzer et al. previously proposed *Adaptive Prespecification*
within TMLE to flexibly and automatically select, from a prespecified set, the
approach that maximizes empirical efficiency in small trials (N40). To avoid
overfitting with few randomized units, selection was previously limited to
working generalized linear models, adjusting for a single covariate. Now, we
tailor Adaptive Prespecification to trials with many randomized units. Using
-fold cross-validation and the estimated influence curve-squared as the loss
function, we select from an expanded set of candidates, including modern
machine learning methods adjusting for multiple covariates. As assessed in
simulations exploring a variety of data generating processes, our approach
maintains Type-I error control (under the null) and offers substantial gains in
precision -- equivalent to 20-43\% reductions in sample size for the same
statistical power. When applied to real data from ACTG Study 175, we also see
meaningful efficiency improvements overall and within subgroups.Comment: 10.5 pages of main text (including 2 tables, 2 figures) + 14.5 pages
of Supporting Inf
Targeted Estimation and Inference for the Sample Average Treatment Effect
While the population average treatment effect has been the subject of extensive methods and applied research, less consideration has been given to the sample average treatment effect: the mean difference in the counterfactual outcomes for the study units. The sample parameter is easily interpretable and is arguably the most relevant when the study units are not representative of a greater population or when the exposure\u27s impact is heterogeneous. Formally, the sample effect is not identifiable from the observed data distribution. Nonetheless, targeted maximum likelihood estimation (TMLE) can provide an asymptotically unbiased and efficient estimate of both the population and sample parameters. In this paper, we study the asymptotic and finite sample properties of the TMLE for the sample effect and provide a conservative variance estimator. In most settings, the sample parameter can be estimated more efficiently than the population parameter. Finite sample simulations illustrate the potential gains in precision and power from selecting the sample effect as the target of inference. As a motivating example, we discuss the Sustainable East Africa Research in Community Health (SEARCH) study, an ongoing cluster randomized trial for HIV prevention and treatment
Recommended from our members
Examining How Dog ‘Acquisition’ Affects Physical Activity and Psychosocial Well-Being: Findings from the BuddyStudy Pilot Trial
Dog owners are more physically active than non-dog owners, but evidence of a causal relationship between dog acquisition and increased physical activity is lacking. Such evidence could inform programs and policies that encourage responsible dog ownership. Randomized controlled trials are the ‘gold standard’ for determining causation, but they are prohibited in this area due to ethical concerns. In the BuddyStudy, we tested the feasibility of using dog fostering as a proxy for dog acquisition, which would allow ethical random assignment. In this single-arm trial, 11 participants fostered a rescue dog for six weeks. Physical activity and psychosocial data were collected at baseline, 6, and 12 weeks. At 6 weeks, mean change in steps/day was 1192.1 ± 2457.8. Mean changes on the Center for Epidemiologic Studies Depression Scale and the Perceived Stress Scale were −4.9 ± 8.7 and −0.8 ± 5.5, respectively. More than half of participants (55%) reported meeting someone new in their neighborhood because of their foster dog. Eight participants (73%) adopted their foster dog after the 6-week foster period; some maintained improvements in physical activity and well-being at 12 weeks. Given the demonstrated feasibility and preliminary findings of the BuddyStudy, a randomized trial of immediate versus delayed dog fostering is warranted
Statistical Analysis Plan for Primary and Selected Secondary Health Endpoints of the SEARCH-Youth Study
This document provides the statistical analytic plan (SAP) for evaluating
health outcomes in the SEARCH-Youth study, a cluster randomized trial designed
to evaluate the effect of a combination intervention on HIV viral suppression
among adolescents and young adults with HIV in rural Uganda and Kenya
(Clinicaltrials.gov: NCT03848728). The SAP was locked prior to unblinding and
effect estimation. This SAP was embargoed until November 04, 2022 when it was
submitted to arXiv.Comment: 14 pages, 1 figur
- …